[Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False by sudhu2k · Pull Request #501 · ROCm/TransformerEngine

sudhu2k · 2026-03-20T19:48:58Z

Summary

Fixes a bug where post_all_gather_processing created transpose for Float8Tensor weights even when keep_fp8_weight_transpose_cache=False, leading to assertion failures in Linear forward.

Problem

With keep_fp8_weight_transpose_cache=False, quantizer.columnwise_usage is set to False (e.g. on ROCm/AMD). post_all_gather_processing was still creating transpose via _create_transpose() because it did not respect columnwise_usage. This triggered the assertion in Linear forward: expected _transpose to be None or an empty tensor when transpose cache is disabled.

Solution

Update post_all_gather_processing in utils.py so it only creates transpose when model_weight._quantizer.columnwise_usage is True.
Parametrize test_cast_master_weights_to_fp8 with keep_fp8_weight_transpose_cache=True and False to cover both cases.

Testing

test_cast_master_weights_to_fp8 with keep_fp8_weight_transpose_cache=False now passes.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

… add unit test when using distributed optimizers

ipanfilo · 2026-03-21T01:30:50Z

transformer_engine/pytorch/tensor/utils.py

            # Delayed scaling and per-tensor current scaling: if backend does not support
            # non-transposed FP8 GEMM, pre-create the transpose.
-            if not is_non_tn_fp8_gemm_supported():
+            if model_weight._quantizer.columnwise_usage and not is_non_tn_fp8_gemm_supported():


Please comment or guard the changes

ipanfilo · 2026-03-21T01:33:53Z

tests/pytorch/distributed/test_cast_master_weights_to_fp8.py

        quantizations.append("fp8_block")

    manual_post_all_gather_processings = [False, True]
+    keep_fp8_weight_transpose_caches = [True, False]


It should be only True on CUDA and better name it keep_fp8_weight_transpose_cache - to match the name of parameter

Fix transpose creation when keep_fp8_weight_transpose_cache=False and…

7a4f791

… add unit test when using distributed optimizers

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 20, 2026 19:48

sudhu2k changed the title ~~Fix transpose creation when keep_fp8_weight_transpose_cache=False and…~~ [Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False Mar 20, 2026

sudhu2k self-assigned this Mar 20, 2026

ipanfilo requested changes Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False#501

[Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False#501
sudhu2k wants to merge 1 commit intodevfrom
sudhu/distiboptim_fp8_transpose_cache_fix

sudhu2k commented Mar 20, 2026

Uh oh!

ipanfilo Mar 21, 2026

Uh oh!

ipanfilo Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sudhu2k commented Mar 20, 2026

Summary

Problem

Solution

Testing

Type of change

Checklist:

Uh oh!

ipanfilo Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants